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1. INTRODUCTION 

The transliteration from a source to a target script is defined as writing the text using the letters of 
the target language provided that the language of the text does not change. Moreover, it preserves the 
pronunciation of a word while transforming it from a source script to a target script [1]. Transliteration from 
different languages to English is useful in bilingual knowledge extraction tasks including information 
retrieval, named entity recognition and automatic bilingual dictionary compilation. [2]-[4]. The out of 
vocabulary (OOV) words like names, and acronyms. In cross-lingual tasks are significantly transcribed into 
the base document language, provided that the source and target do not share the alphabet. The named entity 
transliteration plays a significant role in cross-language tasks, apparently, during document translation from 
source to target language, named entities are transliterated. Transliteration being the subtask of translation as 
it transforms one language script into corresponding similar phonetic characters of the target alphabet poses 
several challenges due to differences in syntax, morphology, and semantics between the source script and the 
target script language. Hindi to English transliteration or vice versa, pose dramatic challenges due to the 
morphologically rich nature of Hindi. For example, a Hindi word @# when transliterated into English has 
multiple transliterations of chabhi, chaabhi, chaabhee, chaabhie. Perhaps, the back transliteration is even 
more challenging as several words transliterate into a single target word. In this work, we employ the neural 
framework for transliteration, basically the sequence-to-sequence modelling based on recurrent neural 
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networks (RNN) that are significantly popular in a wide range of tasks such as text summarization, machine 
translation, and named entity recognition [5], [6]. 

Transliteration has been studied for long, originally [7] modelled as a probabilistic finite state 
transducer machine, which was, subsequently, improved using phonetic and morphological models by [8]. 
Chinese tokens were generated by mapping English names to phonemes and then mapping each phoneme to 
the corresponding token [9]. An English-Russian transliteration system based on weighted finite-state 
transducer techniques and hidden Markov models was developed in [10]. A system for parallel Wikipedia 
titles in English to Tamil, English to Hindi, English to Arabic, and English to Russian using generative 
reinforcement models to produce mappings between source and target alphabet sequences is developed in 
[11]. A machine transliteration system for Bengali to English which relied upon the mapping of alphabets for 
each pair of Bengali-English phonemic mappings was designed in [12]. A phrase-based statistical machine 
translation model for English to Devanagari transliteration was proposed in [5]. They developed two distinct 
statistical systems using MOSES and Stanford Phrasal using English Hindi parallel corpus. Several 
researchers employed machine learning approaches such as [13] propounded transliteration of Marathi to 
English and Hindi to English named entity by segmentation of the source tokens into phonetic tokens and 
applying Support Vector Machines. The technique of Latin-to-Balinese script transliteration using a mobile 
application is interestingly critical [14]. Conditional random fields (CRFs) for transliteration of Hindi- 
English for cross-language information retrieval are suggested in [15]. CRF is also applied in a subword 
based approach to English to Indic languages (Hindi, Kannada and Tamil) [16]. CRF on the English to 
Korean transliteration and Hindi-English names respectively is suggested in [17]. A transliteration scheme 
that involved English to Hindi language pair from news 2009 transliteration task dataset is in [18]. The 
methodology incorporated English and Hindi contextual information for calculating the probabilities and 
chose the one which has a maximum probability and further improved the algorithm by applying post- 
processing rules. Josan and Kaur [19] suggested the transliteration techniques for Punjabi-Hindi in respect of 
Gurumukhi-Devanagari scripts by integrating the character level alignment from source vocabulary to target 
alphabets with statistical techniques. English to Chinese transliteration used a stack of convolutional network 
layers with a basic recurrent network layer on top, which produced promising output but still fell short of the 
phrase-based system of statistical machine translation [20]. Neural machine transliteration gained importance 
recently due to advancements in deep learning techniques [21 ]-[26]. 

The deep neural network (DNN) proved to be quite successful in several language processing tasks, 
however, less work is found in the literature on the problem of Hindi to English transliteration. Therefore, we 
investigate the effectiveness of deep learning models in transliteration by using an encoder-decoder based 
sequence to sequence model. As a preliminary task, we chose the gated recurrent units (GRU) networks as 
the basic element to design the encoder and decoder. The proposed neural machine transliteration framework 
which is essentially an encoder-decoder framework can produce more accurate transliteration than statistical 
systems by capturing the context of the source. The encoder converts the source word into a latent variable 
that holds the meaningful information which is subsequently, processed by the decoder to produce the 
transliteration word. The encoders and decoders are stacked with successive gated recurrent unit (GRU) 
layers on top of the input layer which handles the representation of the transliteration tokens which are 
individual characters. The character-level models are found more successful in sequence-to-sequence models. 
Therefore, for our work, we chose characters instead of words as the atomic elements used in the whole 
transliteration process. The contribution of this work is i) we experimentally evaluate the sequence to 
sequence neural architecture for English to Hindi and Hindi architecture and vice-versa using the parallel 
transliteration corpus and ii) we present the empirical results comparing one, two and three layers of GRU 
architecture for the same source and target scripts. 


2. RESEARCH METHOD 
2.1. Corpus 

For the transliteration task we adopted the Hindi transliteration dataset of [27] in which 83,697 
Hindi-English transliteration pairs are present. No multiword are present in the dataset. Corpus statistics are 
shown in Table 1. 


Table 1. Corpus statistics 
Hindi English 
Maximum word size 25 28 
Average word size 7.96 7.58 
Total number of words 83697 83697 
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2.2. Proposed method 

The sequence-to-sequence neural network modelling is a prominent technique which is based on the 
prediction of output sequence corresponding to its input sequence [28]. Transliteration can be viewed as an 
architecture analogous to the translation sequence to sequence neural model, moreover, it is a subtask in 
language translation while essentially useful in dealing with the named entities, which do not require 
translation [29]. This transliteration model is based on the concept of the encoder-decoder methodology 
which works well in many sequence-to-sequence applications [30], [31]. There are primarily two components 
namely, encoder and decoder, which are a sequence of connected layers. The encoder maps the input text to 
the fixed-size vector, which is the summarization of the source text, and this vector is given to the decoder to 
predict the sequence of generated characters. Both encoder and decoder are two-step phenomena to convert 
input words into a vector of floating-point numbers. In the first step, the text is converted into tokens of 
integers, whereas in the second step, such tokens into the matrixfloating-pointoint numbers with the help of 
an embedding layer. The overall transliteration process using the deep learning encoder-decoder method is 
illustrated in Figure 1. Yao et al. [32] has argued that the notion of encoder and decoder architecture is 
appropriate for general sequence to sequence models. The key principle is to map the entire input sequence to 
a vector and stack the layers of the GRU to produce a sequence of output based on the encoded vector. 


Context 
Representation 


Input text Output Transliterated text 


Encoder Decoder 


Figure 1. Transliteration using encoder decoder method 


2.3. Gated recurrent units 

The gated recurrent units (GRU) [33] are a type of recurrent neural networks (RNN) with the 
dedicated mechanism of resetting and updating the hidden state achieved using the reset gate and update gate 
respectively. The reset gate helps to control the amount of previous state that needs to be retained. Likewise, 
the update gate helps to control how much the new state gets from the old state. Both the gates are 
represented using (1) and (2). GRU is more streamlined and offers faster computations with a simplistic 
model among the RNN variants [33]. 


Ri = 0o (XtWer + Ht-1Wnr + by) (1) 
Zt = o(X,W,, + Hy_1Whz + bz) (2) 


where W and b denote weights and biases respectively. The output of the reset gate is integrated with the 
previous hidden state to obtain the intermediate current hidden state H+, which is further integrated with the 
update gate to obtain the final hidden state H,as shown in: 


H =Z, OH + (1-2) O H, (3) 
where the intermediate hidden state is given as: 
H, = tanh(X,Wy, + (Re © He-1) Wan + bn (4) 


from (3) it can be concluded that with each GRU the short-term dependencies are controlled by the reset 
gates whereas the long-term dependencies are controlled by the update gates. 


2.4. Proposed transliteration model 

The information required by the transliteration system is contained in the words composed of 
characters. Moreover, each character contributes in a different manner. All the characters in Hindi contribute 
to the pronunciation, however, in English some are silent. Therefore, in the input layer and output layers, we 
tokenize the word to capture all the characters from the word. The words written in Hindi are 
morphologically rich and orthographically complex in nature. Each phoneme in Hindi may be weakly 
represented with a single character, however, English phonemes are sometimes composed of multiple 
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alphabets. Therefore, when aT} is transliterated to chabhi, = corresponds to ch, # to bh. Hence, character 
level misalignments are abundant in Hindi to English. In our work, we consider each character as an 
individual unit for transliteration. Each input and output word pair is therefore tokenized into character level 
units instead of phonetic units. A dictionary is generated by assigning the highest integer value to the 
character with the highest frequency in the whole corpus. This dictionary is generated for both Hindi and 
English characters. Subsequently, each character unit is encoded with an integer value and, finally, the length 
of the encoded-word is normalized with padding. The post-processing technique adopted is the inverse of 
pre-processing. After encoding the vectors are obtained for each input word, and subsequently, Hindi words 
are passed into encoder and decoder, whereas English words are passed into decoder. Due to variation in 
length of words, the beginning and end markers are inserted at the input and output vector respectively, to 
align the vectors to the same length. Therefore, the size of the Hindi input vector obtained is 25 and the size 
for output vector obtained is 28. 

The encoder with GRU layers accepts a varying dimension sequence as the source and converts it 
into the hidden layer of fixed dimension, following the design hypothesis of the encoder-decoder. In other 
words, the hidden state of the recurrent encoder captures the input sequence information. A decoder with the 
same number of GRU layers is utilized to predict the next token in order to produce the target sequence token 
by token based on which characters have been seen, alongside the source sequence recorded information. 
Figure 2 demonstrates how to use multiple GRU layers in transliteration for sequence-to-sequence training. 
The encoder's function is to convert a varying length source sequence into a context representation variable c 
of fixed shape and compress the source sequence information in it. 


Input text Output text 


Embedding Layer Embedding Layer 
eS 
EREN EE eN 
GRU Layers --11-4ļ----: GRU Layers 


Context Representation 


Fully Connected Layer 


Output text 


Figure 2. Proposed transliteration model 


Let x,,...,X, be the source sequence where x, being the tt” token in the source sequence. The 
recurrence converts the input vector x; along with the preceding latent state h;.; into the present hidden state 
H, at the time step t. The encoder converts all the hidden states into the context representation at all time 
steps using specialised function q: 


c = q(A,,..., Hr) (5) 


consequently, the context variable turn into just the final hidden state of the source sequence at the last step: 


qh, ..., Hr) = Hr (6) 


in our case, we focussed on bidirectional GRU, in which, the hidden state is based on the subsequence before 
and afterwards with respect to the time, including the present time step input, and transforms the entire 
sequence. Note that in order to acquire the compressed vector representation for each token collectively in 
the source sequence, we use the embedding layer. The embedding layer is significantly a set of weight 
matrices whose count of horizontal values is equal to the size of the source vocab and the count of vertical 
values is equal to the length of the context vector. For a given input token i, the embedding layer brings the i" 
row of its weight matrix and returns it as the feature vector. The context variable c of the output of the 
encoder codes the complete source sequence x1,..., xT. Similarly, for a specified target sequence y4, ..., yp at 
every timestep t’ the conditional probability for the decoder target subsequence and the context variable c is 
given as: 
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it may be noted that the source and target sequences are of different lengths, so we use here t’ for times step 
of output sequences in order to differentiate it with times step t of input sequences in encoders. 

We use another layer of GRU as the decoder in order to model this probability conditioned on 
sequences. The GRU takes the output y,/_, from the previous time step at any time step t” of the output 
sequence and the context vector c as its input, and converts them along with the preceding hidden state s,_, 
to obtain the current hidden state s,’ during the present step. Consequently, the hidden layer transformation 
of the decoder as may be expressed as: 


Set = ge" C, S¢'-1) (8) 


We explicitly utilize the final state in the last layer of the encoder to prepare the initialized decoder 
state. This enforces that there must be the same number of hidden units and layers in the encoder and the 
decoder GRU layers. The context representation variable is added in the decoder input at all the time steps to 
further enhance the encoded input sequence information. Subsequently, the fully connected layer is employed 
to convert the recurrent state to the last decoder layer to produce the likelihood of the target token. The 
decoder predicts a probability allocation for the target tokens at each time level. To get the distribution, we 
apply softmax and measure the cross-entropy loss for optimization. The separate padding tokens are added to 
the last of each sequence so that sequence of tokens having different lengths may be given as mini-batches in 
similar form. The predictions of the padded tokens, however, must be omitted. 


3. RESULTS AND DISCUSSION 

Using the appropriate neural network parameters the models are trained and tested with 10:1 split 
ratio and the validations accuracies are recorded. The best accuracy obtained among all the models is 92.3%. 
The metrics which are used to evaluate the predicted output are character error rate (CER) and word error 
rate (WER). The CER is defined as the fraction of correctly predicted characters among the total number of 
true characters, averaged over all the test words. The WER is defined as the fraction of correctly predicted 
words among the total test words. We present in Table 2 the CER and WER respectively, for the three 
models of Hindi to English transliteration models and three models of English to Hindi transliteration. The 
results show that on increasing the number of GRU layers there is improvement in both CER and WER. For 
Hindi to English transliteration, one and two additional layers improve the CER by 12.3% and 20.2% 
respectively, and WER by 1.3% and 6.2% respectively with respect to the single-layer model. On the other 
hand, the English to Hindi transliteration, one and two additional layers improve the CER by 13.3% and 
26.1% respectively, and WER by 5.6% and 7.7% respectively when compared with the single layer model. 


Table 2. Evaluation results of proposed transliteration models 
Transliteration source and target language Hindi to English English to Hindi 
CER WER CER WER 
1 0.253 0.691 0.398 0.848 
Number of recurrent layers 2 0.222 0.682 0.345 0.801 
3 0.201 0.648 0.294 0.783 


We compare the proposed models with the recent existing works of Hindi to English transliteration 
in Table 3. Hindi to English transliteration includes works of [13], [15], and [17] have reported only the 
model network validation accuracy which is lower than the proposed work. We also observe better 
performance when compared with Arabic to English transliteration of [4], [29]. However, Chinese to English 
transliteration of [20] reports 3.1% lower, and Arabic to English transliteration of [21] reports 1.6% lower 
CER than the proposed model. Moreover, [24] reports the top-1 transliteration accuracy of 53.3%, which is 
equivalent to WER which is lower in this case. 

We initially developed the models for Hindi to English transliteration, however, we also evaluated it 
over English as the source and Hindi as the target language. The comparison with existing works is presented 
in Table 4. This Hindi to English transliteration model shows better results with only some of the existing 
works only. In fact, CER is better than the English to Vietnamese model of [25]. 


Hindi to English transliteration using multilayer gated recurrent units (Mohd Zeeshan Ansari) 


1088 O ISSN: 2502-4752 


Table 3. Comparison of proposed model with existing works on Hindi to English transliteration 


Technique Language pairs Efficiency Rate 
Cross connected multi-layer GRU Hindi to English ACC: 81.6 %; WER: 64.8%;CER: 20.1% 
Orthographic similarity [23] Tamil to English ACC: 53.3% 
GRU [4 Arabic to English WER: 81% 
CNN + RNN [20] Chinese to English CER:16.2% 
GRU + Attention [29] Arabic to English WER: 77.1% 
CRF [17] Hindi to English ACC: 83.98% 
SVM [13] Hindi to English 86.52 
DBN [21] Arabic to English CER: 22.7% 
HMM [15] Hindi to English 72.1% 


Table 4. Comparison of the proposed model with existing works on English to Hindi transliteration 


Authors Language Pair Efficiency Rate 
Cross connected multi-layer GRU English to Hindi ACC: 70.6 %; WER: 78.3%; CER: 29.4% 
Grapheme-phoneme [23] English to Kannada ACC: 85.93% 
GRU + BiGRU [30] English to Arabizi ACC: 80.6% 
LSTM + Attention [25] English to Vietnamese CER: 32.4% 
GRU + Attention [30] English to Arabic WER:65.1% 
CNN + RNN [20] English to Chinese ACC: 28.1% 
Graph Reinforcement [11] English to Hindi Fl: 93% 
MEMM + Alignment [21] English to Persian ACC: 58.4% 
HMM + WEST [11] English to Russian 61% 
CRF [16] English to Hindi 41.8% 


4. CONCLUSION 

We specifically prepare models for the Hindi to English transliteration which is rarely addressed in 
the literature. The models are developed using the sequence to sequence neural network with an underlying 
encoder-decoder methodology. The GRU is used as recurrent units due to their simplicity and faster 
performance. The encoder translates the input text into an intermediate representation which is given to the 
decoder which maps the sequence with the output text. The character level approach is used for input and 
output representation for capturing subword level information. Different variants of the models are generated 
in single and multiple layers of GRU and the results are recorded. A maximum of 20.2 % improvement is 
observed in CER as compared to the base model in Hindi to English transliteration. It is observed that the 
performance improved on increases on increasing GRU layers is however at the cost of increased training 
time due to an increase in the number of parameters. We compare our work with existing Hindi to English, 
Arabic to English and Chinese to English transliteration models and observe that our model outperforms all 
with CER of 20.1% and validation accuracy of 81.6% excefor pt Chinese to English in which CER is 16.2%. 
We also apply the same model to devethe lop English to Hindi transliteration model by exchanging the input 
with output and vice-versa. However, their test results are not as good as the Hindi to English transliteration 
model. 
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